Overview and Motivation

A hospital readmission is when a patient who is discharged from the hospital, gets re-admitted again within a certain period of time. The rates of hosptial readmission are now considered an indicator of hospital quality and also contributes to the high cost of healthcare in America. In order to adress this issue, the Centers for Medicare & Medicaid Services has established the Hospital Readmissions Reduction Program (HRRP) with the goal of improving the quality of care for patients and reducing healthcare spending. The primary way in which they do this apply bying payment penalties to hospitals that have more than expected readmission rates for certain conditions. While diabetes has not yet been added to HRBP’s list of conditions, Diabetes is the is the [condition with the 3rd most all-cause, 30-day readmissions for Medicaid patients, and in 2011 American hospitals spent over $41 billion on diabetic patients who got readmitted within 30 days of discharge] (https://www.hcup-us.ahrq.gov/reports/statbriefs/sb172-Conditions-Readmissions-Payer.jsp).

The ability to determine which risk factors lead to higher readmission in such patients, and accordingly being able to predict which patients will get readmitted, could help save hospitals billions of dollars while also improving thef quality of care. With this goal in mind, we set out to answer these two questions:

1. How well can we predict 30 day hospital readmission of diabetes patients using medical claims data?
2. What factors are the most important in predicting hosptial readmissions in a diabetic patient? 

Dataset

The dataset we will be working with represents 10 years (1999-2008) of clinical care at 130 US hospitals and integrated delivery networks. It includes over 50 features representing patient and hospital outcomes. Information was extracted from the database for encounters that satisfied the following criteria.

1. It is an inpatient encounter (a hospital admission).
2. It is a diabetic encounter, that is, one during which any kind of diabetes was entered to the system as a diagnosis.
3. The length of stay was at least 1 day and at most 14 days.
4. Laboratory tests were perfocrmed during the encounter.
5. Medications were administered during the encounter.

The data contains such attributes as patient number, race, gender, age, admission type, time in hospital, medical specialty of admitting physician, number of lab test performed, HbA1c test reslt, diagnosis, number of medication, diabetic medications, number of outpatient, inpatient, and emergency visits in the year before the hospitalization, etc.

The dataset is publicly avaliable from the UCI Machine Learning Repository.

Read in the dataset

To get us started, we read in the dataset and remove any patients that appear more than once in the dataset in order to avoid any bias that might come with this overreprsentation of certain patients. This resulted in dataset being reduced to approximately 70,000 encounters:

# Reading in data
data <- read.csv("dataset/diabetic_data.csv")

# remove duplicates
data <- data[!duplicated(data$patient_nbr), ]

Data Preparation and Preprocessing

Before moving on to do some data cleaning, let’s observe the descriptive statistics of each column in the training dataset.

The skimr package provides a nice solution to show key descriptive stats for each column.

data[data == "?"] <- NA
skimmed <- skim_to_wide(data)
knitr::kable(skimmed[, c(1:3, 5:6, 9:10, 16)])
type variable missing n n_unique mean sd hist
factor A1Cresult 0 71518 4 NA NA NA
factor acarbose 0 71518 3 NA NA NA
factor acetohexamide 0 71518 2 NA NA NA
factor age 0 71518 10 NA NA NA
factor change 0 71518 2 NA NA NA
factor chlorpropamide 0 71518 4 NA NA NA
factor citoglipton 0 71518 1 NA NA NA
factor diabetesMed 0 71518 2 NA NA NA
factor diag_1 11 71518 696 NA NA NA
factor diag_2 294 71518 725 NA NA NA
factor diag_3 1225 71518 758 NA NA NA
factor examide 0 71518 1 NA NA NA
factor gender 0 71518 3 NA NA NA
factor glimepiride 0 71518 4 NA NA NA
factor glimepiride.pioglitazone 0 71518 1 NA NA NA
factor glipizide 0 71518 4 NA NA NA
factor glipizide.metformin 0 71518 2 NA NA NA
factor glyburide 0 71518 4 NA NA NA
factor glyburide.metformin 0 71518 4 NA NA NA
factor insulin 0 71518 4 NA NA NA
factor max_glu_serum 0 71518 4 NA NA NA
factor medical_specialty 34477 71518 70 NA NA NA
factor metformin 0 71518 4 NA NA NA
factor metformin.pioglitazone 0 71518 2 NA NA NA
factor metformin.rosiglitazone 0 71518 2 NA NA NA
factor miglitol 0 71518 4 NA NA NA
factor nateglinide 0 71518 4 NA NA NA
factor payer_code 31043 71518 17 NA NA NA
factor pioglitazone 0 71518 4 NA NA NA
factor race 1948 71518 5 NA NA NA
factor readmitted 0 71518 3 NA NA NA
factor repaglinide 0 71518 4 NA NA NA
factor rosiglitazone 0 71518 4 NA NA NA
factor tolazamide 0 71518 2 NA NA NA
factor tolbutamide 0 71518 2 NA NA NA
factor troglitazone 0 71518 2 NA NA NA
factor weight 68665 71518 9 NA NA NA
integer admission_source_id 0 71518 NA 5.66 4.16 ▅▇▁▁▁▁▁▁
integer admission_type_id 0 71518 NA 2.1 1.51 ▇▃▃▁▁▁▁▁
integer discharge_disposition_id 0 71518 NA 3.59 5.27 ▇▂▁▁▁▁▁▁
integer encounter_id 0 71518 NA 1.6e+08 1e+08 ▅▇▇▅▃▂▁▁
integer num_lab_procedures 0 71518 NA 43.08 19.95 ▃▃▇▆▂▁▁▁
integer num_medications 0 71518 NA 15.71 8.31 ▆▇▂▁▁▁▁▁
integer num_procedures 0 71518 NA 1.43 1.76 ▇▃▂▂▁▁▁▁
integer number_diagnoses 0 71518 NA 7.25 1.99 ▁▂▅▃▇▁▁▁
integer number_emergency 0 71518 NA 0.1 0.51 ▇▁▁▁▁▁▁▁
integer number_inpatient 0 71518 NA 0.18 0.6 ▇▁▁▁▁▁▁▁
integer number_outpatient 0 71518 NA 0.28 1.07 ▇▁▁▁▁▁▁▁
integer patient_nbr 0 71518 NA 5.5e+07 3.9e+07 ▇▇▅▆▅▁▁▁
integer time_in_hospital 0 71518 NA 4.29 2.95 ▇▇▂▃▂▁▁▁

Now, we will drop some of the columns for various reasons: - Encounter ID: an uneccessary column that provides no information for readmission - Patient Number: an unecessary column that provides no information for readmission - Weight: While potentially useful, there is too many missing values - Payer Code: too many missing values - Medical Specialty: probably unecessary and too many missing values - Diabetes Medication: redundant information - diag_2 and diag_3: ICD codes are hard to deal with and often lack predictive power, so we will try to use only primary diagnosis

drops <- c("encounter_id", "patient_nbr", "weight", "payer_code", 
           "medical_specialty", "diabetesMed", "diag_2", "diag_3")
data <- data[ , !(names(data) %in% drops)]

Next, we need to drop any patients who’s discharge status is listed as “expired”. If a patient has died, there is no chance of readmission.

data <- filter(data, !(discharge_disposition_id %in% c(11, 19, 20, 21)))

We will also remove the 3 subjects with an unknown gender.

data <- data %>% filter(gender != "Unknown/Invalid")
data$gender <- droplevels(data$gender)

Data Encodings

  • Gender will be encoded as 0 for Female, 1 for Male
data$gender <- as.integer(data$gender) - 1
  • Age is given as 10 year intervals. We will split this interval into the median age of the interval.
# Encoding ages
data$age <- ifelse(data$age == "[0-10)", 5, 
                  ifelse(data$age == "[10-20)", 15, 
                         ifelse(data$age == "[20-30)", 25, 
                                ifelse(data$age == "[30-40)", 35, 
                                       ifelse(data$age == "[40-50)", 45, 
                                              ifelse(data$age == "[50-60)", 55, 
                                                     ifelse(data$age == "[60-70)", 65, 
                                                            ifelse(data$age == "[70-80)", 75, 
                                                                   ifelse(data$age == "[80-90)", 85, 95)))))))))
  • Glucose Serum will be encoded as follows:
    • 0 for No measurment
    • 1 for Normal reading
    • 2 for >200 & >300 (abnormal readings)
data$max_glu_serum <- ifelse(data$max_glu_serum == "None", 0, 
                          ifelse(data$max_glu_serum == "Norm", 1, 2))
  • A1C test results will be encoded as follows:
    • 0 for No measurement
    • 1 for Normal reading
    • 2 for an abnormal reading (>7 or >8)
data$A1Cresult <- ifelse(data$A1Cresult == "None", 0, 
                         ifelse(data$A1Cresult == "Norm", 1, 
                                ifelse(data$A1Cresult %in% c(">7", ">8"), 2, NA)))
  • Changes in insulin doage will be encoded as follows:
    • 0 for no insulin
    • 1 for decrease in insulin
    • 2 for steady insulin
    • 3 for increase in insulin
data$insulin <- ifelse(data$insulin == "No", 0, ifelse(data$insulin == "Down", 1, ifelse(data$insulin == "Steady", 2, 3)))
  • Change in Medincations will be encoded as follows:
    • 0 for no change in medication
    • 1 for change in medication
data$change <- ifelse(data$change == "Ch", 1, 0)
  • Race will be encoded as follows:
    • 0 for Caucasian
    • 1 for African American
    • 2 for Other
data$race <- ifelse(data$race == "Caucasian", 0, ifelse(data$race == "AfricanAmerican", 1, 2))
data$race[is.na(data$race)] <- 2
  • Readmitted Status will be encoded as follows:
    • 0 for No Readmission within 30 days
    • 1 for a Readmission in <30 days
data$readmitted <- ifelse((data$readmitted == "NO" | data$readmitted == ">30"), 0, 1)
data %>% group_by(readmitted) %>% summarize(count=n())
## # A tibble: 2 x 2
##   readmitted count
##        <dbl> <int>
## 1          0 64138
## 2          1  6293

We see that we have a highly imbalanced class problem.

New Features

  • Healthcare Utilization is calculated as the sum of number of outpatient, emergency, and inpatient encouters the patient has had in the past year
# Sum of number of outpatient, emergency and inpatient encounters 
data$healthcare_utilization <- (data$number_outpatient + data$number_emergency + data$number_inpatient)
  • Sum on non-insulin diabetes medications
# Creating sum of other meds on column
# (not all of these columns have all four options, but it was easier to code this way - gives same result)
data$metformin <- ifelse(data$metformin %in% c("Up", "Steady", "Down"),1,0)
data$repaglinide <- ifelse(data$repaglinide %in% c("Up", "Steady", "Down"),1,0)
data$nateglinide <- ifelse(data$nateglinide %in% c("Up", "Steady", "Down"),1,0)
data$chlorpropamide <- ifelse(data$chlorpropamide %in% c("Up", "Steady", "Down"),1,0)
data$glimepiride <- ifelse(data$glimepiride %in% c("Up", "Steady", "Down"),1,0)
data$acetohexamide <- ifelse(data$acetohexamide %in% c("Up", "Steady", "Down"),1,0)
data$glipizide <- ifelse(data$glipizide %in% c("Up", "Steady", "Down"),1,0)
data$glyburide <- ifelse(data$glyburide %in% c("Up", "Steady", "Down"),1,0)
data$tolbutamide <- ifelse(data$tolbutamide %in% c("Up", "Steady", "Down"),1,0)
data$pioglitazone <- ifelse(data$pioglitazone %in% c("Up", "Steady", "Down"),1,0)
data$rosiglitazone <- ifelse(data$rosiglitazone %in% c("Up", "Steady", "Down"),1,0)
data$acarbose <- ifelse(data$acarbose %in% c("Up", "Steady", "Down"),1,0)
data$miglitol <- ifelse(data$miglitol %in% c("Up", "Steady", "Down"),1,0)
data$troglitazone <- ifelse(data$troglitazone %in% c("Up", "Steady", "Down"),1,0)
data$tolazamide <- ifelse(data$tolazamide %in% c("Up", "Steady", "Down"),1,0)
data$examide <- ifelse(data$examide %in% c("Up", "Steady", "Down"),1,0)
data$citoglipton <- ifelse(data$citoglipton %in% c("Up", "Steady", "Down"),1,0)
data$glyburide.metformin <- ifelse(data$glyburide.metformin %in% c("Up", "Steady", "Down"),1,0)
data$glipizide.metformin <- ifelse(data$glipizide.metformin %in% c("Up", "Steady", "Down"),1,0)
data$glimepiride.pioglitazone <- ifelse(data$glimepiride.pioglitazone %in% c("Up", "Steady", "Down"),1,0)
data$metformin.rosiglitazone <- ifelse(data$metformin.rosiglitazone %in% c("Up", "Steady", "Down"),1,0)
data$metformin.pioglitazone <- ifelse(data$metformin.pioglitazone %in% c("Up", "Steady", "Down"),1,0)

data$num_other_diabetes_meds_up_stdy_dwn <- (data$metformin + data$repaglinide + data$nateglinide + data$chlorpropamide + data$glimepiride + data$acetohexamide + data$glipizide + data$glyburide + data$tolbutamide + data$pioglitazone + data$rosiglitazone + data$acarbose + data$miglitol + data$troglitazone + data$tolazamide + data$examide + data$citoglipton + data$glyburide.metformin + data$glipizide.metformin + data$glimepiride.pioglitazone + data$metformin.rosiglitazone + data$metformin.pioglitazone)


drops2 <- c("metformin", "repaglinide", "nateglinide", "chlorpropamide", "glimepiride", "acetohexamide", "glipizide", 
            "glyburide", "tolbutamide", "pioglitazone", "rosiglitazone", "acarbose", "miglitol", "troglitazone", 
            "tolazamide", "examide", "citoglipton", "glyburide.metformin", "glipizide.metformin", "glimepiride.pioglitazone",
            "metformin.rosiglitazone", "metformin.pioglitazone")

data <- data[ , !(names(data) %in% drops2)]

One Hot Encodings

  • Admission types, discharge dispositions, admission source will be grouped and one hot encoded
# set up dummy variables for admission types
data <- mutate(data, at_emergent = as.numeric(admission_type_id %in% c(1, 2, 7)), 
              at_elective = as.numeric(admission_type_id == 3), 
              at_other = as.numeric(admission_type_id %in% c(4, 5, 6, 8)))

# Set up dummy variables for discharge dispositions
data <- mutate(data, dd_home = as.numeric(discharge_disposition_id %in% c(1, 6, 8)), 
              dd_facility_transfer = as.numeric(discharge_disposition_id %in% c(2, 3, 4, 5, 10, 16, 17, 22, 23, 24, 30, 27, 28, 29, 13, 14)), 
              dd_other = as.numeric(discharge_disposition_id %in% c(7, 18, 25, 26, 9, 12, 15))) 
              # dd_admitted = as.numeric(discharge_disposition_id %in% c(9, 12, 15)), #lumping admitted into other
              # add_expired = as.numeric(discharge_disposition_id %in% c(11, 19, 20, 21)), (dropped patients who expired)
              # dd_hospice = as.numeric(discharge_disposition_id %in% c(13, 14))) #lumping hospice into transfer

# Set up dummy variables for admission source
data <- mutate(data, as_outpatient = as.numeric(admission_source_id %in% c(1, 2, 3)),
              as_facility_transfer = as.numeric(admission_source_id %in% c(4, 5, 6, 10, 18, 19, 22, 25, 26)),
              as_ed = as.numeric(admission_source_id %in% c(7)),
              as_other = as.numeric(admission_source_id %in% c(8, 9, 15, 17, 20, 21, 11, 12, 13, 14, 23, 24)))
              #as_newborn = as.numeric(admission_source_id %in% c(11, 12, 13, 14, 23, 24)))#as_newborn added to as_other

data <- within(data, rm(admission_type_id))
data <- within(data, rm(discharge_disposition_id))
data <- within(data, rm(admission_source_id))
  • ICD codes have over 700 different “levels” in our dataframe. We will group these ICD codes into 20 different categories based on the Strack et al. 2014 paper. We will then one hot encode these categories for each patient.
data$diag_1 <- as.character(data$diag_1)

data$diag_1grp <- ifelse (data$diag_1 == "?", "Unknown", 
          ifelse(grepl(data$diag_1, pattern = "[EV]") == T, "External", 
            ifelse(floor(as.numeric(data$diag_1)) == 250, "Circulatory", 
              ifelse(data$diag_1 %in% c(390:459, 785), "Diabetes", 
                ifelse(data$diag_1 %in% c(460:519, 786), "Respiratory", 
                  ifelse(data$diag_1 %in% c(520:579, 787), "Digestive", 
                    ifelse(data$diag_1 %in% c(800:999), "Injury", 
                      ifelse(data$diag_1 %in% c(710:739), "Musculoskeletal",
                        ifelse(data$diag_1 %in% c(580:629, 788), "Genitourinary", 
                         ifelse(data$diag_1 %in% c(140:239), "Neoplasm", 
                           ifelse(data$diag_1 %in% c(780, 781, 784, 790:799, 740:759), "Other", # added congenital to other
                             ifelse(data$diag_1 %in% c(240:249, 251:279), "Endocrine_Nutrition_Metabolic",  
                               ifelse(data$diag_1 %in% c(680:709, 782), "Skin", 
                                 ifelse(data$diag_1 %in% c(001:139), "Infectious",
                                   ifelse(data$diag_1 %in% c(290:319), "Mental", 
                                     ifelse(data$diag_1 %in% c(280:289), "Blood",
                                       ifelse(data$diag_1 %in% c(320:359), "Nervous",
                                         ifelse(data$diag_1 %in% c(630:679), "Pregnancy", 
                                          ifelse(data$diag_1 %in% c(360:389), "Sense", "Unknown") 
                                            #ifelse(data$diag_1 %in% c(740:759), "Congenital", "NULL")
                                ))))))))))))))))))
## Warning in ifelse(floor(as.numeric(data$diag_1)) == 250, "Circulatory", :
## NAs introduced by coercion
data$diag_1grp <- as.factor(data$diag_1grp)

data <- one_hot(data.table(data))
data <- as.data.frame(data) # convert back to data.frame format

data <- within(data, rm(diag_1))
data <- na.omit(data)

Removing Highly Correlated Variables

We will take a look at the correlation matrix of our dataset, and remove any highly correlated variables (correlation greater than abs(0.6)).

data <- data %>% select(-readmitted, readmitted) # move readmitted to the last column value
cor_mat <- cor(data[, 1:46])
cor_mat[!lower.tri(cor_mat)] <- 0
data <- data[,!apply(cor_mat,2,function(x) any(abs(x) > 0.6))]

cor_mat2 <- cor(data[, 1:41])

get_lower_tri<-function(cormat){
    cormat[upper.tri(cormat)] <- NA
    return(cormat)
}

cor_mat_melted <- melt(get_lower_tri(cor_mat2), na.rm = T)

ggplot(data = cor_mat_melted, aes(x=Var1, y=Var2, fill=value)) + 
  geom_tile() +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
                       midpoint = 0, limit = c(-1,1), space = "Lab") +
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = -45, vjust = 1, hjust = 0),
        legend.position = c(0,1), legend.justification = c(0,1),
        axis.title = element_blank(), plot.title = element_text(hjust = 0.5)) +
  labs(title = "Correlation matrix") +
  scale_y_discrete(position = "right")

#corrplot(cor_mat2, type = "upper", order = "hclust", tl.col = "black", tl.srt = 45)

Train Test Split

The dataset is now ready, and we will split it into training(80%) and test(20%) datasets using caret’s createDataPartition function.

# write cleaned dataset
write.csv(data, file = "dataset/clean_diabetic_data.csv", row.names = F)

# Create the training and test datasets
set.seed(123)

# Step 1: Get row numbers for the training data (80% train split, 20% testing split)
trainRowNumbers <- createDataPartition(data$readmitted, p=0.8, list=FALSE)

# Step 2: Create the training  dataset
trainData <- data[trainRowNumbers,]

# Step 3: Create the test dataset
testData <- data[-trainRowNumbers,]

# Step 4: Save the datasets 
write.csv(trainData, file = "dataset/trainData.csv", row.names = F)
write.csv(testData, file = "dataset/testData.csv", row.names = F)

# recode the target variable as a factor
trainData$readmitted <- as.factor(trainData$readmitted)
testData$readmitted <- as.factor(testData$readmitted)

Exploratory Data Analysis

PCA and t-SNE

To get us started, we will try some dimensionality reduction techniques to see if there is any separation between our two classes. We start with principal component analysis (PCA) which uses an orthogonal transformation to convert the original variables into a set of linear combinations of the original variables, called principal components. PCA is commonly used as a tool in exploratory data analysis.

We also use the popular T-distributed Stochastic Neighbor Embedding (t-SNE) algorithm, which is a machine learning visualization tool. It is a non-linear dimensionality reduction technique that models each high-dimensional observation by a two-dimensional point in such a way that similar observations are modeled by nearby points with high probability.

pca_tsne_df <- data[,c(1:41)]
# scale variables
for(i in 1:ncol(pca_tsne_df)){
  pca_tsne_df[[i]] <- ((pca_tsne_df[[i]] - mean(pca_tsne_df[[i]])) / sd(pca_tsne_df[[i]]))
}

pca_tsne_df <- t(pca_tsne_df)

#PCA

PC <- prcomp(pca_tsne_df)

var_explained <- data.frame("PC" = c(1:41), "eigenvalue" = (PC$sdev)^2)
var_explained$var_explained <- var_explained$eigenvalue / sum(var_explained$eigenvalue)
var_explained$cumulative <- cumsum(var_explained$var_explained) / sum(var_explained$var_explained)

# ggplot(var_explained, aes(x = PC, y = var_explained)) + 
#   geom_line() +
#   labs(title = "Elbow plot: proportion of variance explained by each PC")
# 
# ggplot(var_explained, aes(x = PC, y = cumulative)) + 
#   geom_line() +
#   labs(title = "Cumulative variance explained by PC_1 through PC_n",
#        x = "n", y = "var_explained")

p1 <- ggplot(var_explained, aes(x = PC, y = var_explained)) + 
  geom_line() +
  labs(title = "Elbow plot: proportion of variance \nexplained by each PC") +
  theme(panel.grid.minor = element_blank())
p2 <- ggplot(var_explained, aes(x = PC, y = cumulative)) + 
  geom_line() + 
  labs(title = "Cumulative proportion of variance \nexplained by PC_1 through PC_n",
       x = "PC", y = "variance explained") +
  theme(axis.title.y = element_blank(), panel.grid.minor = element_blank())

ggsave(grid.arrange(p1, p2, ncol = 2), filename = "docs/images/pcastats_grid.png", width = 9, height = 4)

outdf <- cbind(data, PC$rotation)
outdf$readmitted <- as.factor(outdf$readmitted)

ggplot(outdf, aes(x = PC1, y = PC2)) +
  geom_point(aes(color = readmitted), alpha = 0.5) +
  scale_color_manual(values = c("dodgerblue", "firebrick1")) +
  labs(title = "Classes are not well-separated by principal components")

# TSNE
tsne_model_1 = Rtsne(t(as.matrix(pca_tsne_df)), check_duplicates=FALSE, pca=TRUE,perplexity = 10, theta=0.5, dims=2)

## getting the two dimension matrix
d_tsne_1 = as.data.frame(tsne_model_1$Y) 

d_tsne_1$readmitted <- as.factor(data$readmitted)

## plotting the results without clustering
ggplot(d_tsne_1, aes(x=V1, y=V2)) +  
  geom_point(aes(color = readmitted)) +
  scale_color_manual(values = c("dodgerblue", "firebrick1")) +
  labs(title = "Classes are not well-separated by t-SNE")

We do not see any class separation of readmitted classes by PCA or t-SNE analysis.

Grid of plots for website

# script for making grid of plots for website

p1 <- ggplot(data=data, aes(x=factor(readmitted, labels = c("No Readmission \nwithin 30 Days", "Readmission \nin < 30 Days")), y=num_lab_procedures)) + 
  geom_boxplot(fill="firebrick1", alpha=0.6) + 
  labs(y="Number of Lab Procedures", x="Readmission Status", title="Number of Lab Procedures", fill="Readmission Status") +
  theme(axis.title.x = element_blank(), plot.margin = margin(0.5,0.5,0.5,0.5,"cm"))

p2 <- ggplot(data=data, aes(x=factor(readmitted, labels = c("No Readmission \nwithin 30 Days", "Readmission \nin < 30 Days")), y=num_procedures)) + 
  geom_boxplot(fill="firebrick1", alpha=0.6) + 
  labs(y="Number of Non-Lab Procedures", x="Readmission Status", title="Number of Non-Lab Procedures", fill="Readmission Status") +
  theme(axis.title.x = element_blank(), plot.margin = margin(0.5,0.5,0.5,0.5,"cm"))

p3 <- ggplot(data=data, aes(x=factor(readmitted, labels = c("No Readmission \nwithin 30 Days", "Readmission \nin < 30 Days")), y=num_medications)) + 
  geom_boxplot(fill="firebrick1", alpha=0.6) + 
  labs(y="Number of Medications Administered", x="Readmission Status", title="Number of Medications Administered", fill="Readmission Status") +
  theme(axis.title.x = element_blank(), plot.margin = margin(0.5,0.5,0.5,0.5,"cm"))

p4 <- ggplot(data=data, aes(x=factor(readmitted, labels = c("No Readmission \nwithin 30 Days", "Readmission \nin < 30 Days")), y=healthcare_utilization)) + 
  geom_boxplot(fill="firebrick1", alpha=0.6) + 
  labs(y="Healthcare Utilization", x="Readmission Status", title="Healtchare Utilization", fill="Readmission Status") +
  theme(axis.title.x = element_blank(), plot.margin = margin(0.5,0.5,0.5,0.5,"cm"))

p5 <- ggplot(data=data, aes(x=factor(readmitted, labels = c("No Readmission \nwithin 30 Days", "Readmission \nin < 30 Days")), y=num_other_diabetes_meds_up_stdy_dwn)) + 
  geom_boxplot(fill="firebrick1", alpha=0.6) + 
  labs(y="Number of Diabetes Medications Changed", x="Readmission Status", title="Number of Diabetes Medications Changed", fill="Readmission Status") + 
  theme(axis.title.x = element_blank(), plot.margin = margin(0.5,0.5,0.5,0.5,"cm"))

p6 <- ggplot(data=data, aes(x=factor(readmitted, labels = c("No Readmission \nwithin 30 Days", "Readmission \nin < 30 Days")), y=number_diagnoses)) + 
  geom_boxplot(fill="firebrick1", alpha=0.6) + 
  labs(y="# of Diagnoses", x="Readmission Status", title="Number of Diagnoses", fill="Readmission Status") +
  theme(axis.title.x = element_blank(), plot.margin = margin(0.5,0.5,0.5,0.5,"cm"))

ggsave(grid.arrange(p1, p2, p3, p4, p5, p6, ncol=2), filename = "docs/images/feature_grid.png", width = 8.5, height = 11)

Composition Graphs

ggplot(data=data, aes(x=factor(race, labels = c("Caucasian", "African American", "Other")), 
                              fill=factor(readmitted, 
                                          labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")))) +
  geom_bar(position=position_dodge(), color = "black") +
  labs(title="Categorywise Bar Chart of Race", x="Race", y="Count", fill="Readmission Status") 

ggplot(data=data, aes(x=factor(gender, labels = c("Female", "Male")), 
                              fill=factor(readmitted, 
                                          labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")))) +
  geom_bar(position=position_dodge(), color = "black") +
  labs(title="Categorywise Bar Chart of Gender", x="Gender", y="Count", fill="Readmission Status") 

ggplot(data=data, aes(x=factor(A1Cresult, labels = c("No Measurement", "Normal Reading", "Abnormal")), 
                              fill=factor(readmitted, 
                                          labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")))) +
  geom_bar(position=position_dodge(), color = "black") +
  labs(title="Categorywise Bar Chart of A1C Test Result", x="A1C", y="Count", fill="Readmission Status") 

ggplot(data=data, aes(x=factor(insulin, labels = c("No Insulin", "Decrease in Insulin", "Steady Insulin", "Increase in Insulin")), 
                              fill=factor(readmitted, 
                                          labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")))) +
  geom_bar(position=position_dodge(), color = "black") +
  labs(title="Categorywise Bar Chart of Insulin Change", x="Insulin Change", y="Count", fill="Readmission Status")

ggplot(data=data, aes(x=factor(change, labels = c("No Change", "Change in Medication")), 
                              fill=factor(readmitted, 
                                          labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")))) +
  geom_bar(position=position_dodge(), color = "black") +
  labs(title="Categorywise Bar Chart of Medication Change", x="Medication Change", y="Count", fill="Readmission Status")

Distribution Graphs

ggplot(data=data, aes(x=time_in_hospital, 
                     fill=factor(readmitted, 
                                 labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")))) + 
  geom_density(alpha=0.5) + 
  labs(x="Time in Hospital", y="Density", 
       title="Distribution of Time in Hospital Grouped by Readmission Status", fill="Readmission Status")

ggplot(data=data, aes(x=factor(readmitted, 
                                 labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")), 
                     y=time_in_hospital)) + 
  geom_boxplot(fill="firebrick1", alpha=0.6) + 
  labs(y="Time in Hospital", x="Readmission Status", 
       title="Distribution of Time in Hospital Grouped by Readmission Status", fill="Readmission Status")

ggplot(data=data, aes(x=num_lab_procedures, 
                     fill=factor(readmitted, 
                                 labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")))) + 
  geom_density(alpha=0.5) + 
  labs(x="# of Lab Proedures", y="Density", 
       title="Distribution of # of Lab Procedures Grouped by Readmission Status", fill="Readmission Status")

ggplot(data=data, aes(x=factor(readmitted, 
                                 labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")), 
                     y=num_lab_procedures)) + 
  geom_boxplot(fill="firebrick1", alpha=0.6) + 
  labs(y="# of Lab Procedures", x="Readmission Status", 
       title="Distribution of # of Lab Procedures Grouped by Readmission Status", fill="Readmission Status")

ggplot(data=data, aes(x=num_procedures, 
                     fill=factor(readmitted, 
                                 labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")))) + 
  geom_density(alpha=0.5) + 
  labs(x="# of Non-Lab Proedures", y="Density", 
       title="Distribution of # of Non-Lab Procedures Grouped by Readmission Status", fill="Readmission Status")

ggplot(data=data, aes(x=factor(readmitted, 
                                 labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")), 
                     y=num_procedures)) + 
  geom_boxplot(fill="firebrick1", alpha=0.6) + 
  labs(y="# of Non-Lab Procedures", x="Readmission Status", 
       title="Distribution of # of Non-Lab Procedures Grouped by Readmission Status", fill="Readmission Status")

ggplot(data=data, aes(x=num_medications, 
                     fill=factor(readmitted, 
                                 labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")))) + 
  geom_density(alpha=0.5) + 
  labs(x="# of Medications Administered", y="Density", 
       title="Distribution of # of Medications Administered Grouped by Readmission Status", fill="Readmission Status")

ggplot(data=data, aes(x=factor(readmitted, 
                                 labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")), 
                     y=num_medications)) + 
  geom_boxplot(fill="firebrick1", alpha=0.6) + 
  labs(y="# of Medications Administered", x="Readmission Status", 
       title="Distribution of # Medications Administered Grouped by Readmission Status", fill="Readmission Status")

ggplot(data=data, aes(x=number_diagnoses, 
                     fill=factor(readmitted, 
                                 labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")))) + 
  geom_density(alpha=0.5) + 
  labs(x="# of Diagnoses", y="Density", 
       title="Distribution of # of Diagnoses Grouped by Readmission Status", fill="Readmission Status")

ggplot(data=data, aes(x=factor(readmitted, 
                                 labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")), 
                     y=number_diagnoses)) + 
  geom_boxplot(fill="firebrick1", alpha=0.6) + 
  labs(y="# of Diagnoses", x="Readmission Status", 
       title="Distribution of # of Diagnoses Grouped by Readmission Status", fill="Readmission Status")

ggplot(data=data, aes(x=age, 
                     fill=factor(readmitted, 
                                 labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")))) + 
  geom_density(alpha=0.5) + 
  labs(x="End of Age Interval", y="Density", 
       title="Distribution of End Age of Interval Grouped by Readmission Status", fill="Readmission Status")

ggplot(data=data, aes(x=factor(readmitted, 
                                 labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")), 
                     y=age)) + 
  geom_boxplot(fill="firebrick1", alpha=0.6) + 
  labs(y="End of Age Interval", x="Readmission Status", 
       title="Distribution of End Age of Interval Grouped by Readmission Status", fill="Readmission Status")

ggplot(data=data, aes(x=healthcare_utilization, 
                     fill=factor(readmitted, 
                                 labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")))) + 
  geom_density(alpha=0.5) + 
  labs(x="Healtchare Utilization", y="Density", 
       title="Distribution of Healthcare Utilization by Readmission Status", fill="Readmission Status")

ggplot(data=data, aes(x=factor(readmitted, 
                                 labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")), 
                     y=healthcare_utilization)) + 
  geom_boxplot(fill="firebrick1", alpha=0.6) + 
  labs(y="Healthcare Utilization", x="Readmission Status", 
       title="Distribution of Healtchare Utilization Grouped by Readmission Status", fill="Readmission Status")

ggplot(data=data, aes(x=num_other_diabetes_meds_up_stdy_dwn, 
                     fill=factor(readmitted, 
                                 labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")))) + 
  geom_density(alpha=0.5) + 
  labs(x="# of Diabetes Medications Changed", y="Density", 
       title="Distribution of # of Diabetes Medications Changed Grouped by Readmission Status", fill="Readmission Status")

ggplot(data=data, aes(x=factor(readmitted, 
                                 labels = c("No Readmission within 30 Days", "Readmission in < 30 Days")), 
                     y=num_other_diabetes_meds_up_stdy_dwn)) + 
  geom_boxplot(fill="firebrick1", alpha=0.6) + 
  labs(y="# of Diabetes Medications Changed", x="Readmission Status", 
       title="Distribution of # of Diabetes Medications Changed Grouped by Readmission Status", fill="Readmission Status")

Statistical Analysis

Imbalanced Class Problem

We saw earlier that we have a large imbalanced class problem. Approximately 90% of our observations did not have a hosptial readmission within 30 days. This is about what we would expect given the problem domain. Most people will not be readmitted to the hospital, but a small subset of people will. Because of this, it is important to properly adjust the metrics and methods used to achieve our goal of predicting hosptial readmission.

Typically in a classification problem, one cares most about achieving accuracy. However, given that 90% of our points belong to one class, we could achieve 90% accuracy by just predicting the majority class every time. This isn’t useful though because we are most interested in identifying instances of the miniority class, those who will have a hosptial readmission. In this case, it is more important to optimize for a metric like recall, precision, F1 score, and AUC. We will use these metrics, namely AUC, to compare our models.

SMOTE Algorithm For Unbalanced Classification Problems

One way to help improve the training process in an imbalanced class problem is to try to rebalance them by oeither versampling instances of the minority class or undersampling instances of the majority class. This should, in theory, lead to classifiers which are not biased torwards one class or another. One popular techinque for doing this is the [Synthetic Minority Over-sampling Technique (SMOTE)] (https://arxiv.org/pdf/1106.1813.pdf).

SMOTE is a method which creates new instances of the minority class by forming convex combinations of neighboring instances. It effectively draws lines between minority points in the feature space, and samples along these lines. This allows us to balance our data-set without as much overfitting, as we create new synthetic examples rather than using duplicates. Regardless, there is still the potential for overfitting given that we are still creating data from existing data points.

if(file.exists("dataset/trainSMOTE.csv")) {
  trainData_SMOTE <- read.csv("dataset/trainSMOTE.csv")
} else {
  trainData$readmitted <- as.factor(trainData$readmitted)
  trainData_SMOTE <- SMOTE(readmitted ~ ., trainData) 
  write.csv(trainData_SMOTE, "dataset/trainSMOTE.csv", row.names = F)
}

Machine Learning Classifiers

Our problem is a classic example of a two-way classification supervised learning problem. We will use three different machine learning models to try to predict hospital readmission:

1. XGBoost 
2. KNN
3. GLMnet

XGBoost

XGBoost stands for eXtreme Gradient Boosting. As indicated by the name, it is a boosting method that combines the output of many different Classification And Regression Trees (CART). CARTs are like decision trees except that in addition to simply producing a classification they attach a score to each individual leaf. When we combine several trees the values attached to a particular observation are added together. XGBoost’s classification and regression outputs are based upon the output of the sum of these values. Training consists of selecting a new tree to add to the ensemble based upon which one best optimizes the loss function. Hence, why it is called gradient boosting; we are adding the trees that take a step in the best direction for optimizing the objective function.

# create XGB Boost dense matrix objects --> faster performance 
dtrain <- xgb.DMatrix(data = as.matrix(trainData_SMOTE[, 1:41]), label = as.numeric(trainData_SMOTE$readmitted))
dtest <- xgb.DMatrix(data = as.matrix(testData[, 1:41]), label = as.numeric(testData$readmitted))

Hyperparameter optimization

In machine learning, hyperparameter optimization is the problem of choosing a set of optimal hyperparameters for a learning algorithm. A hyperparameter is a parameter whose value is set before the learning process begins. By contrast, the values of other parameters are derived via training.

if(file.exists("dataset/xgb_params.csv")) {
    best_params <- read.csv("dataset/xgb_params.csv")
} else {
  param_grid <- expand.grid(objective = "binary:logistic",
                          eval_metric = "auc", 
                          eta = c(0.1, 0.3, 0.5), 
                          max_depth = c(2,4,6),
                          gamma = c(2,3,4), 
                          scale_pos_weight = c(2, 3))

  best_test_auc <- 0
  best_params <- param_grid[1, ]
   
  for (i in 1:nrow(param_grid)) {
    print(i)
    xgb_cv <- xgb.cv(params = param_grid[i, ], data = dtrain, nrounds = 150, 
                     nfold = 10, early_stopping_rounds = 5, verbose = F)
    if(xgb_cv$evaluation_log[nrow(xgb_cv$evaluation_log), test_auc_mean] > best_test_auc) {
      best_test_auc = xgb_cv$evaluation_log[nrow(xgb_cv$evaluation_log), test_auc_mean] 
      best_params <- param_grid[i, ]
    } 
  }
  write.csv(best_params, "dataset/xgb_params.csv", row.names = F)
}

Training the Model

xgb <- xgb.train(params = best_params, data = dtrain, nrounds = 500)
xgbPredict <- predict(xgb, dtest)
xgbPredict <- as.numeric(xgbPredict> 0.5)
confusionMatrix(reference = as.factor(testData$readmitted), data = as.factor(xgbPredict), mode = "everything", positive="1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 9761  753
##          1 3033  537
##                                           
##                Accuracy : 0.7312          
##                  95% CI : (0.7238, 0.7385)
##     No Information Rate : 0.9084          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0999          
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.41628         
##             Specificity : 0.76294         
##          Pos Pred Value : 0.15042         
##          Neg Pred Value : 0.92838         
##               Precision : 0.15042         
##                  Recall : 0.41628         
##                      F1 : 0.22099         
##              Prevalence : 0.09159         
##          Detection Rate : 0.03813         
##    Detection Prevalence : 0.25348         
##       Balanced Accuracy : 0.58961         
##                                           
##        'Positive' Class : 1               
## 

We see from this first model that we are not performing very well. We have an accuracy of 73.1%, precision of 15.04%, recall of 41.6%, and a F1 score of 0.22.

Variable Importance According to XGBoost

importance <- xgb.importance(model = xgb)
xgb.ggplot.importance(importance_matrix = importance, top_n = 10) + 
  theme(axis.text = element_text(size = 12), 
        plot.title = element_text(size=14),
        panel.grid.minor = element_blank(), 
        panel.grid.major = element_line(size=0.25), 
        axis.title.y = element_blank())

According to the XGBoost algorithm, healthcare utilization, age, and time in hospital are the three most important factors for predicting hospital readmission.

Simple ROC Curve

pred <- prediction(xgbPredict, testData$readmitted)
AUC <- performance(pred, "auc")@y.values[[1]]
perf_ROC=performance(pred,"tpr","fpr") #plot the actual ROC curve
plot(perf_ROC, main="ROC plot")
text(0.9, 0.0,paste("AUC = ",format(AUC, digits=3, scientific=FALSE)))

We see that XGBoost produces a testing AUC-ROC of 0.59.

KNN

KNN is an algorithm that takes in numerical features, then calculates the distance between a given observation and all the other observations in the dataset, then chooses the predicted label for the given point based on what label the majority of that point’s “K nearest neighbors” (K points closest to the point of interest) possess. For this reason it is important to choose a \(K\) in such a way as to avoid tie votes among the “K neighbors.”

ctrl <- trainControl(method="cv", number = 10)
cl <- makePSOCKcluster(detectCores())
registerDoParallel(cl)
knn <- train(as.factor(readmitted) ~ ., data = trainData_SMOTE, method = "knn", metric = "kappa",
             trControl = ctrl, preProcess = c("center","scale"), tuneLength = 5)
## Warning in train.default(x, y, weights = w, ...): The metric "kappa" was
## not in the result set. Accuracy will be used instead.
stopCluster(cl)
plot(knn)

We see that even just testing KNN on 5 different k values that our performance is decreasing with increased values of k. Therefore, we choose not to test any more.

knnPredict <- predict(knn, testData)
confusionMatrix(reference = as.factor(testData$readmitted), data = as.factor(knnPredict), mode = "everything", positive="1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 8111  704
##          1 4683  586
##                                           
##                Accuracy : 0.6175          
##                  95% CI : (0.6094, 0.6255)
##     No Information Rate : 0.9084          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.037           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.45426         
##             Specificity : 0.63397         
##          Pos Pred Value : 0.11122         
##          Neg Pred Value : 0.92014         
##               Precision : 0.11122         
##                  Recall : 0.45426         
##                      F1 : 0.17869         
##              Prevalence : 0.09159         
##          Detection Rate : 0.04161         
##    Detection Prevalence : 0.37411         
##       Balanced Accuracy : 0.54412         
##                                           
##        'Positive' Class : 1               
## 

We see from this model that we are still not performing very well. We have an accuracy of 61.78%, precision of 11.11%, recall of 45.42%, and a F1 score of 0.18.

Simple ROC Curve

knnPred <- prediction(as.numeric(knnPredict), testData$readmitted)
knn_AUC <- performance(knnPred, "auc")@y.values[[1]]
knn_perf_ROC=performance(knnPred,"tpr","fpr") #plot the actual ROC curve
plot(knn_perf_ROC, main="ROC plot")
text(0.9, 0.0,paste("AUC = ",format(knn_AUC, digits=3, scientific=FALSE)))

We see that KNN produces a testing AUC-ROC of 0.544.

GLMnet

GLMnet fits a generalized linear model via penalized maximum likelihood. The algorithm implemented in the package computes the regularization path for the elastic-net penalty over a grid of values for the regularization parameter \(\lambda\) (https://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html#log). The elastic-net penalty is controlled by α , and bridges the gap between lasso (α=1, the default) and ridge (α=0). The tuning parameter λ controls the overall strength of the penalty. To reduce noise caused by data-splitting, we used 10-fold cross-validation for tuning parameter values using the caret package as described here (http://www.milanor.net/blog/cross-validation-for-predictive-analytics-using-r/).

Hyperparameter Optimization and Trainging of GLMnet

trainData_SMOTE$readmitted <- as.factor(trainData_SMOTE$readmitted)
#cv_splits <- createFolds(trainData_SMOTE$readmitted, k = 10, returnTrain = TRUE)

glmnet_grid <- expand.grid(alpha = c(0,  .1,  .2, .4, .6, .8, 1),
                           lambda = seq(.01, .2, length = 10))
glmnet_ctrl <- trainControl(method = "cv", number = 20)

cl <- makePSOCKcluster(detectCores())
registerDoParallel(cl)
glmnet_fit <- train(readmitted ~ ., data = trainData_SMOTE,
                    method = "glmnet",
                    family = "binomial",
                    preProcess = c("center", "scale"),
                    tuneGrid = glmnet_grid,
                    trControl = glmnet_ctrl)
stopCluster(cl)

# Best tuning parameters
glmnet_fit$bestTune
##   alpha lambda
## 1     0   0.01

The best tuning parameter chosen after 10-fold cross validation is \(\alpha = 0, \lambda = 0.01\). This means that we are effectively using a ridge regression.

trellis.par.set(caretTheme())
plot(glmnet_fit, scales = list(x = list(log = 2)))

pred_classes <- predict(glmnet_fit, newdata = testData, type = "raw")
confusionMatrix(data = as.factor(pred_classes), reference = as.factor(testData$readmitted), mode = "everything", positive="1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 10247   825
##          1  2547   465
##                                           
##                Accuracy : 0.7606          
##                  95% CI : (0.7534, 0.7676)
##     No Information Rate : 0.9084          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1009          
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.36047         
##             Specificity : 0.80092         
##          Pos Pred Value : 0.15438         
##          Neg Pred Value : 0.92549         
##               Precision : 0.15438         
##                  Recall : 0.36047         
##                      F1 : 0.21618         
##              Prevalence : 0.09159         
##          Detection Rate : 0.03302         
##    Detection Prevalence : 0.21386         
##       Balanced Accuracy : 0.58069         
##                                           
##        'Positive' Class : 1               
## 

Even in our last model, we are still not performing very well. We have an accuracy of 76.06%, precision of 15.4%, recall of 36.84%, and a F1 score of 0.21.

Model coefficients as selected by GLMnet

plot(varImp(glmnet_fit,scale=F), top = 10)

coef(glmnet_fit$finalModel, s = glmnet_fit$bestTune$lambda) 
## 42 x 1 sparse Matrix of class "dgCMatrix"
##                                                     1
## (Intercept)                             -0.2971526132
## race                                    -0.0764752202
## gender                                   0.0139492093
## age                                      0.0757599572
## time_in_hospital                         0.0392643694
## num_lab_procedures                       0.0736108053
## num_procedures                          -0.0572806512
## num_medications                          0.0472362874
## number_emergency                         0.0426201602
## number_inpatient                         0.2429572892
## number_diagnoses                         0.0726062034
## max_glu_serum                           -0.0104628308
## A1Cresult                               -0.0816749941
## insulin                                  0.0419668317
## change                                   0.0201164513
## healthcare_utilization                   0.0044873986
## num_other_diabetes_meds_up_stdy_dwn     -0.0252987930
## at_elective                             -0.0439580950
## dd_facility_transfer                     0.2723021578
## dd_other                                 0.0582020860
## as_facility_transfer                    -0.0557204345
## as_ed                                   -0.0395831948
## as_other                                -0.0443884520
## diag_1grp_Blood                         -0.0181245675
## diag_1grp_Circulatory                    0.0353631182
## diag_1grp_Diabetes                       0.0583229720
## diag_1grp_Digestive                      0.0058505936
## diag_1grp_Endocrine_Nutrition_Metabolic -0.0059598295
## diag_1grp_External                       0.0248698499
## diag_1grp_Genitourinary                 -0.0090824450
## diag_1grp_Infectious                    -0.0227158656
## diag_1grp_Injury                        -0.0021541673
## diag_1grp_Mental                         0.0196700772
## diag_1grp_Musculoskeletal               -0.0244841718
## diag_1grp_Neoplasm                       0.0003117428
## diag_1grp_Nervous                        0.0003644846
## diag_1grp_Other                         -0.0215781812
## diag_1grp_Pregnancy                     -0.0146363069
## diag_1grp_Respiratory                   -0.0593963844
## diag_1grp_Sense                         -0.0494156823
## diag_1grp_Skin                          -0.0137203141
## diag_1grp_Unknown                       -0.0177019384

The most important features for predicting hospital readmission according to GLMnet are whether of not a patient was transfered to another facility at dischagre, the number of inpatient visits in the past year for a patient, and their A1C test result.

Simple ROC Curve

glmPred <- prediction(as.numeric(pred_classes), testData$readmitted)
glm_AUC <- performance(glmPred, "auc")@y.values[[1]]
glm_perf_ROC=performance(glmPred,"tpr","fpr") #plot the actual ROC curve
plot(glm_perf_ROC, main="ROC plot")
text(0.9, 0.0,paste("AUC = ",format(glm_AUC, digits=3, scientific=FALSE)))

We see that GLMnet produces a testing AUC-ROC of 0.581.

Conclusion

ML.Model AUC Recall Precision F1.Score
XGBoost 0.590 0.416 0.150 0.221
KNN 0.544 0.454 0.111 0.179
GLMnet 0.581 0.360 0.154 0.216

As we can see, none of our models were very succesful at predicting hospital readmission. Based on the above metrics, it appears that XGBoost performs the best, but it is still not very good. We see that we only correctly predict 41.6% of the cases that will be readmitted, and only 15% of the patients predicted to be readmitted actually will be. In the end, it seems like the current features from medical claims data are not sufficient in predicting hospital readmission. This can be seen in our exploratory data analysis as well where there appeared to be no real separation between classes across different predictor variables.